-
Notifications
You must be signed in to change notification settings - Fork 965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool-based annotation of out of memory errors. #5196
Conversation
1f5b9cb
to
a65c3db
Compare
Hi @jmchilton, this sounds incredibly interesting to me as we cannot use HISAT2 on collections with the wheat genome because more than 2 jobs in the same node make them crash silently. The jobs crash due to "out of memory" error but in the history they appear in green, so you do not realise this until downstream tools fail. Would your solution solve this problem? |
This is fantastic @jmchilton - it will be extremely beneficial to my work, and I'm looking forward to making use of it. Thanks! |
Unfortunately I can't say much to the changes of the code. But from the description and my understanding of the code I definitely think that this PR is a step in the right direction and can be really helpful. My opinion is still that memory errors should be detected by the runner -- as far as possible (and for the other cases your changes are really useful). My reasoning is the following: like run time violations (which are already handled by the runner) also memory violations are not the tool's fault, but a fault of the runner (which requested to little resources or was just unable to ensure that sufficient resources are available). (Similarly also cluster errors that are mentioned in #3320 should be handled by the runner). A second line of reasoning is that the detection of memory errors should/must work for each and every tool. So if we let the tools decide if there is a memory violation one would need to enable the detection of memory violations in every tool. Because I guess (but maybe I'm wrong) that most tools don't check for memory problems, but use features of the programming language or just don't check at all, this would mean that most tools would need to check for the same error messages which means quite a bit of code duplication in the tool xml files. Please note that if I write runner, then this seems the best place for me, but I'm not an expert. The important thing for me is that such problems should not be detected by single tools. So some of the job classes might also be a good place. We have to distinguish programming language specific error messages (e.g. std::bad_alloc in the case of c++ programs) and program specific error messages (e.g. if the programs catch out-of-memory and writing a custom message to stderr/stdout or returning specific error codes). Handling programming language specific error messages by the runner has the potential problem (as you mentioned earlier) that there might be tools where such an error does not indicate an error. The question with this is: does this affect a significant number of tools? It might be easier to implement an exception for these tools (e.g. by filtering the spurious messages from stdout/stderr) compared to adapting all tools to consider language specific error messages. Clearly (as discussed earlier), tool specific memory error messages should be handled by the tool. For the detection of such messages your changes appear to go in this direction, but it seems that so far only exit codes can be used to detect OOM? Or is it possible to specify regular expressions for tool specific So overall my suggestion for this PR is to:
Maybe a bit of background on how I intend to implement job handling on our Galaxy instance (which uses a grid engine on a large cluster). Jobs are resubmitted with increasing run time starting with 15min and then 1h, 1day, 1week. A similar 2 stage escalation strategy is implemented for memory (we have normal nodes and high mem nodes). So with a few job destinations we cover all tools and input sizes -- without wasting to much run time. This of course relies on the detection of runtime and memory violations. All other strategies, eg. dynamic job destinations or specifying time/memory per tool seem impractical to me because they require to much time for administration or waste resources (the same tool started for bacteria and an eukaryote genome would get the same resources). |
This tool has custom <stdio>
<regex level="fatal" match="hisat2-align exited with value 1" source="both" />
<regex level="fatal" match="hisat2: not found" source="both" />
<exit_code range="1:" />
</stdio> <stdio>
<regex level="fatal" match="hisat2-align exited with value 1" source="both" />
<regex level="fatal" match="hisat2: not found" source="both" />
<regex level="fatal_oom" match="out of memory" source="both" />
<exit_code range="1:" />
</stdio> It is weird in general that the HiSat exit code is ignore though - that is a fix that should be made regardless of this change to Galaxy I suspect. I've created an issue on the IUC issue tracker galaxyproject/tools-iuc#1634. |
Great. Then the tool based errors seem to be covered by your PR. Still I would suggest that we move the language dependent errors (like std::bad_alloc) elsewhere. |
a65c3db
to
bb5db5f
Compare
Yeah - I rebased this with one small tweak that enabled this (I thought it already was) and then added some docs for it in
These ones are only used if I've never really liked Would we make
Or maybe
|
You may leave it as it is. I'm just looking for a way that a Galaxy admin can enable this behavior (checking for language specific memory errors) for all tools (regardless of the tool configurations). Maybe job_conf? |
I think this is ok. This entire PR looks great to me. |
I'm completely with you. The PR is a great addition. On job_conf.xml: Having a |
@jmchilton can you please resolve the conflict. Thanks! |
- Update explicit stdio definitions as well as detect_errors-based definitions to allow this. - Update runner code to allow resubmission in such cases.
bb5db5f
to
f017de3
Compare
Thanks @jmchilton! I missed the rebase somehow. What do you think about @bernt-matthias last comment, especially for older tools?
@bernt-matthias I see that this could be useful especially for older tools, but I would also encourage Admins to submit this upstream to the tools. |
Is there a way to document the new functionality at https://docs.galaxyproject.org/en/master/dev/schema.html? |
This should have done that. Our docs just stopped generating/publishing - the build time grew too large I think. I've increased the timeout and re-ran the last such Jenkins job. https://jenkins.galaxyproject.org/view/All/job/latest-Sphinx-Docs/ |
@jmchilton By looking at the code, it seems to me that |
Backend was enhanced to allow this in galaxyproject#5196 and a new entry point to setting exit code for such exit conditions but the custom parsing wasn't updated to allow it. Thanks to @nsoranzo for the catch.
stdio
definitions as well as detect_errors-based definitions to allow this.detect_errors="exit_code"
is set, an extra attributeoom_exit_code
can be defined on the same command block.detect_errors='aggressive"
is set, Galaxy will now look for newer strings indicating common memory problems. (Do we need to restrict this to just newer profile version tools for greater reproducibility or do we want to catch more errors on older tools?).Fixes #3320
Implements #3107